Better support for classification tasks with large number of label classes #561

iomap · 2023-09-01T22:04:50Z

New option to filter labels_list by similarity to input example.

Two new optional fields are now present in prompt_config schema:

        "label_selection": true,
        "label_selection_count": 10

…e list of labels by similarity to the prompt

…g prints

rishabh-bhargava · 2023-09-01T22:13:44Z

src/autolabel/tasks/classification.py

+                # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
+                VectorStoreWrapper(cache=False),
+                # This is the number of examples to produce.
+                k=10,


We should have this value of k be configurable. Maybe 10 is a reasonable default, but we might want workflows in the future that automatically test for the right value of k.

done in f97a82d

can now specify value of k in autolabel config file like so:

"label_selection": true, "label_selection_count": 10

rajasbansal

did we benchmark this on banking/ledgar. I hope there isn't a big drop in performance using this approach

rajasbansal · 2023-09-01T22:11:32Z

src/autolabel/tasks/classification.py

+                # This is the list of labels available to select from.
+                label_examples,
+                # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
+                OpenAIEmbeddings(),


can we use the embedding model that is the same as the one used for the seed examples, this can be read from the config

+1 to read this from the embedding model section in Autolabel config

done in latest commit.

It now chooses embedding function based on config.embedding_provider()

self.label_selector = LabelSelector.from_examples( labels=self.config.labels_list(), k=self.config.label_selection_count(), embedding_func=PROVIDER_TO_MODEL.get( self.config.embedding_provider(), DEFAULT_EMBEDDING_PROVIDER )(), )

rajasbansal · 2023-09-01T22:14:18Z

src/autolabel/tasks/classification.py

+            similar_prompt = FewShotPromptTemplate(
+                example_selector=example_selector,
+                example_prompt=example_prompt,
+                prefix="Input: {example}",


use the example template from the config here? see how the seed examples are prepared

This was unnecessary and has been removed in latest commit.

rajasbansal · 2023-09-01T22:16:50Z

src/autolabel/tasks/classification.py

+            )
+            label_examples = [{"input": label} for label in labels_list]
+
+            example_selector = SemanticSimilarityExampleSelector.from_examples(


instead of creating this each time, is it possible to construct this example selector once and then just call sample labels each time. This would make sure that we just embed the label list once.

Good idea.

I now construct the LabelSelector once in agent.run() and agent.plan(). This way embeddings of labels are only computed once.

rajasbansal · 2023-09-01T22:17:10Z

src/autolabel/tasks/classification.py

+        # if large number of labels, filter labels_list by similarity of labels to input
+        if num_labels >= 50:
+            example_prompt = PromptTemplate(
+                input_variables=["input"],


nit: call this label?

removed in latest commit

rajasbansal · 2023-09-01T22:18:45Z

src/autolabel/tasks/classification.py

+                # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
+                OpenAIEmbeddings(),
+                # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
+                VectorStoreWrapper(cache=False),


went through the code, ideally we don't need to set this cache as False and can use the cache setting from teh config, but if not, this would still be fine.

rajasbansal · 2023-09-01T22:20:37Z

src/autolabel/tasks/classification.py

@@ -55,6 +60,40 @@ def construct_prompt(self, input: Dict, examples: List) -> str:
        # prepare task guideline
        labels_list = self.config.labels_list()
        num_labels = len(labels_list)
+
+        # if large number of labels, filter labels_list by similarity of labels to input
+        if num_labels >= 50:


can we do this based on some config setting. just want to make sure that we have the ability to turn this on or off. we can do this on num_labels > 50 if the config setting corresponding to this is not set

Agree with previous comments. we should enable this "label selection" from a config parameter, not a hardcoded num_labels threshold

done in f97a82d

can now turn on/off label selection in config (as well as the number of labels to select), like so:

"label_selection": true, "label_selection_count": 10

rajasbansal · 2023-09-01T22:21:31Z

src/autolabel/tasks/classification.py

+                # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
+                VectorStoreWrapper(cache=False),
+                # This is the number of examples to produce.
+                k=10,


lets make this configurable

done in latest commit.

iomap · 2023-09-01T22:25:18Z

did we benchmark this on banking/ledgar. I hope there isn't a big drop in performance using this approach

My initial test on Ledgar:
Previous accuracy: 71% (from Jupyter notebook in autolabel/examples/ledgar)
Accuracy with change (on 100 samples): 68%

Roughly the same, but need to continue testing. Will try out a full run on banking and ledgar datasets.

iomap · 2023-09-01T22:37:46Z

src/autolabel/tasks/classification.py

+            split_lines = sampled_labels.split("\n")
+            labels_list = []
+            for i in range(1, len(split_lines)):
+                if split_lines[i]:
+                    labels_list.append(split_lines[i])


This might break for input examples that contain newline characters \n.

Maybe I do a check that split_lines[i] in labels_list before appending

This should not be needed once the implementation is revamped

correct, this has been removed.

nihit

I suggest abstracting all this logic away in a "Label Selector" class, similar to what the Example Selectors do.

The Label Selector would be initialized once in the run/plan agent method by reading appropriate fields from the config (again, very similar to example selector)

The agent can then call this object's "select labels" function when labeling each example to get a list of K most likely labels, like https://github.com/refuel-ai/autolabel/blob/main/src/autolabel/labeler.py#L183

And pass it to the task object (like https://github.com/refuel-ai/autolabel/blob/main/src/autolabel/labeler.py#L189)

nihit · 2023-09-05T23:42:01Z

src/autolabel/tasks/classification.py

+            label_examples = [{"input": label} for label in labels_list]
+
+            example_selector = SemanticSimilarityExampleSelector.from_examples(
+                # This is the list of labels available to select from.
+                label_examples,
+                # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
+                OpenAIEmbeddings(),
+                # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
+                VectorStoreWrapper(cache=False),
+                # This is the number of examples to produce.
+                k=10,
+            )
+            similar_prompt = FewShotPromptTemplate(
+                example_selector=example_selector,
+                example_prompt=example_prompt,
+                prefix="Input: {example}",
+                suffix="",
+                input_variables=["example"],
+            )


please revamp this implementation.

No need to go via FewShotExampleTemplate and semantic similarity example selector.

The label selection should conceptually consist of 3 steps: (i) input row --> formatted example (ii) compute embedding of the formatted example (iii) find nearest neighbors from among the label list

The embeddings for labels in the label list should be computed just once, not once per row

I have revamped the implementation.

The FewShotExampleTemplate and semantic similarity example selector has been removed entirely.

Embeddings for labels in the label list is computed only once, in agent.plan() and agent.run()

nihit · 2023-09-05T23:43:24Z

src/autolabel/tasks/classification.py

+                # This is the list of labels available to select from.
+                label_examples,
+                # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
+                OpenAIEmbeddings(),


+1 to read this from the embedding model section in Autolabel config

nihit · 2023-09-05T23:44:23Z

src/autolabel/tasks/classification.py

@@ -55,6 +60,40 @@ def construct_prompt(self, input: Dict, examples: List) -> str:
        # prepare task guideline
        labels_list = self.config.labels_list()
        num_labels = len(labels_list)
+
+        # if large number of labels, filter labels_list by similarity of labels to input
+        if num_labels >= 50:


Agree with previous comments. we should enable this "label selection" from a config parameter, not a hardcoded num_labels threshold

nihit · 2023-09-05T23:45:16Z

src/autolabel/tasks/classification.py

+            split_lines = sampled_labels.split("\n")
+            labels_list = []
+            for i in range(1, len(split_lines)):
+                if split_lines[i]:
+                    labels_list.append(split_lines[i])


This should not be needed once the implementation is revamped

iomap · 2023-09-07T00:27:18Z

Worth noting that after refactoring this PR, I am noticing a slight (~5%) impact on labeling accuracy on Ledgar dataset

Results with label_selection = true, k = 10

┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ accuracy ┃ support ┃ completion_rate ┃
┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
. │ 0.6465 │ 100 │ 0.99 │
└──────────┴─────────┴─────────────────┘

Results with label_selection = false

┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ accuracy ┃ support ┃ completion_rate ┃
┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ 0.7071 │ 100 │ 0.99 │
└──────────┴─────────┴─────────────────┘

Prior to the revamp, accuracy was about equal in both cases.

Perhaps our cos_sim() function isn't quite as good as the langchain similarity selector I was using previously? That or the embedding function is configured differently. I have noticed that embedding generation time is longer now than it was prior.

iomap · 2023-09-07T00:29:00Z

How I tested:

import os
# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '...'

from autolabel import get_data

get_data('ledgar')

import json

from autolabel import LabelingAgent

# load the config
with open('config_ledgar.json', 'r') as f:
     config = json.load(f)

# create an agent for labeling
agent = LabelingAgent(config=config)

from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

# now, do the actual labeling
ds = agent.run(ds, output_name="output_test.csv", max_items=100)

config_ledgar.json

{
    "task_name": "LegalProvisionsClassification",
    "task_type": "classification",
    "dataset": {
        "label_column": "label",
        "delimiter": ","
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "prompt": {
        "task_guidelines": "You are an expert at understanding legal contracts. Your job is to correctly classify legal provisions in contracts into one of the following categories.\nCategories:{labels}\n",
        "labels": [
            "Agreements",
            "Amendments",
            "Adjustments",
            "Anti-Corruption Laws",
            "Applicable Laws",
            "Approvals",
            "Arbitration",
            "Assignments",
            "Assigns",
            "Authority",
            "Authorizations",
            "Base Salary",
            "Benefits",
            "Binding Effects",
            "Books",
            "Brokers",
            "Capitalization",
            "Change In Control",
            "Closings",
            "Compliance With Laws",
            "Confidentiality",
            "Consent To Jurisdiction",
            "Consents",
            "Construction",
            "Cooperation",
            "Costs",
            "Counterparts",
            "Death",
            "Defined Terms",
            "Definitions",
            "Disability",
            "Disclosures",
            "Duties",
            "Effective Dates",
            "Effectiveness",
            "Employment",
            "Enforceability",
            "Enforcements",
            "Entire Agreements",
            "Erisa",
            "Existence",
            "Expenses",
            "Fees",
            "Financial Statements",
            "Forfeitures",
            "Further Assurances",
            "General",
            "Governing Laws",
            "Headings",
            "Indemnifications",
            "Indemnity",
            "Insurances",
            "Integration",
            "Intellectual Property",
            "Interests",
            "Interpretations",
            "Jurisdictions",
            "Liens",
            "Litigations",
            "Miscellaneous",
            "Modifications",
            "No Conflicts",
            "No Defaults",
            "No Waivers",
            "Non-Disparagement",
            "Notices",
            "Organizations",
            "Participations",
            "Payments",
            "Positions",
            "Powers",
            "Publicity",
            "Qualifications",
            "Records",
            "Releases",
            "Remedies",
            "Representations",
            "Sales",
            "Sanctions",
            "Severability",
            "Solvency",
            "Specific Performance",
            "Submission To Jurisdiction",
            "Subsidiaries",
            "Successors",
            "Survival",
            "Tax Withholdings",
            "Taxes",
            "Terminations",
            "Terms",
            "Titles",
            "Transactions With Affiliates",
            "Use Of Proceeds",
            "Vacations",
            "Venues",
            "Vesting",
            "Waiver Of Jury Trials",
            "Waivers",
            "Warranties",
            "Withholdings"
        ],
        "example_template": "Example: {example}\nOutput: {label}",
        "few_shot_examples": "seed.csv",
        "few_shot_selection": "semantic_similarity",
        "few_shot_num": 4,
        "label_selection": true,
        "label_selection_count": 10
    }
}

rajasbansal

lgtm!

rajasbansal · 2023-09-07T17:59:23Z

src/autolabel/labeler.py

@@ -346,7 +376,13 @@ def plan(
                )
            else:
                examples = []
-            final_prompt = self.task.construct_prompt(input_i, examples)
+            if self.config.label_selection():


nit: this would give an error if label selection was set to true for any task other than classification. This is because the construct_prompt has been changed just for the classification task. Any way to catch this i.e label selection not supported for this task

good catch. Will check that it is a classification task (if label_selection = true)

done in 2a6ec29

rajasbansal · 2023-09-07T18:00:17Z

src/autolabel/tasks/classification.py

@@ -21,6 +21,11 @@

 import json

+from langchain.prompts.example_selector import SemanticSimilarityExampleSelector


probably don't need these imports now?

removed these imports in e5ff12a

…es having OPENAI_API_KEY when importing autolabel

nihit · 2023-09-08T17:41:59Z

@iomap to followup with any updates to documentation

iomap added 2 commits September 1, 2023 13:32

for classification tasks with a large number of categories, filter th…

a070422

…e list of labels by similarity to the prompt

replace Chroma DB with autolabels own VectorStoreWrapper. Remove debu…

e3c2126

…g prints

iomap requested review from nihit, rajasbansal and Abhinav-Naikawadi September 1, 2023 22:05

iomap changed the title ~~Better support for classification tasks with many label classes~~ Better support for classification tasks with large number of label classes Sep 1, 2023

rishabh-bhargava reviewed Sep 1, 2023

View reviewed changes

rajasbansal reviewed Sep 1, 2023

View reviewed changes

iomap mentioned this pull request Sep 1, 2023

[Feature Request]: Transform to support filtering out labels with heuristics/embedding for classification/extraction tasks with a large number of labels #555

Closed

iomap commented Sep 1, 2023

View reviewed changes

nihit requested changes Sep 5, 2023

View reviewed changes

iomap added 2 commits September 6, 2023 16:34

move label selection logic into its own class

3cfca54

allow for LabelSelector.k to be specified in config

f97a82d

iomap requested review from nihit and rajasbansal September 7, 2023 00:16

clear up comment

16f769b

rajasbansal approved these changes Sep 7, 2023

View reviewed changes

iomap added 3 commits September 7, 2023 14:53

remove default for embedding_func=OpenAIEmbeddings() , as this requir…

547e811

…es having OPENAI_API_KEY when importing autolabel

if task_selection=true, check that task_type=classification

2a6ec29

remove unnused imports

e5ff12a

nihit approved these changes Sep 8, 2023

View reviewed changes

Merge branch 'main' into many_classes_support

921c096

iomap merged commit acb8755 into main Sep 19, 2023
2 checks passed

iomap deleted the many_classes_support branch September 19, 2023 20:35

Vaibhav2001 mentioned this pull request Feb 13, 2024

Added better functionality for label selection #713

Merged

		@@ -21,6 +21,11 @@

		import json

		from langchain.prompts.example_selector import SemanticSimilarityExampleSelector

Better support for classification tasks with large number of label classes #561

Better support for classification tasks with large number of label classes #561

Conversation

iomap commented Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajasbansal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iomap commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

nihit Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iomap commented Sep 7, 2023 • edited Loading

iomap commented Sep 7, 2023 • edited Loading

rajasbansal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nihit commented Sep 8, 2023

iomap commented Sep 1, 2023 •

edited

Loading

nihit Sep 5, 2023 •

edited

Loading

nihit Sep 5, 2023 •

edited

Loading

nihit left a comment •

edited

Loading

nihit Sep 5, 2023 •

edited

Loading

nihit Sep 5, 2023 •

edited

Loading

iomap commented Sep 7, 2023 •

edited

Loading

iomap commented Sep 7, 2023 •

edited

Loading